Rate-distortion theory
Rate-distortion theory (Shannon, 1959) characterizes the fundamental tradeoff between Compression and fidelity. Given a source and a limited-capacity channel, it asks: what is the minimum number of bits (rate ) needed to represent the source such that expected distortion stays below a threshold ?
The rate-distortion function is:
Lower distortion (better fidelity) requires higher rate (more bits). The Lagrangian form minimizes:
The two terms measure different things. (Mutual information) is the rate — how much knowing the compressed representation tells you about the original . If the encoder throws everything away ( independent of ), : zero bits, maximum distortion. In the lossless discrete limit, preserving everything gives : you need as many bits as the source has entropy. The mutual information tells you how many distinct codewords the encoding scheme effectively needs — more mutual information means finer distinctions preserved, more bits to transmit. It’s an aggregate property of the whole encoding scheme, not of any single sample.
In contrast, is a pointwise distortion function that scores a single reconstruction against the original (e.g., squared error).
Lower distortion requires higher rate (more bits to preserve finer distinctions), but this is a property of optimal encodings — a bad encoder can waste bits and still have high distortion. traces the Pareto frontier. The objective minimizes both, with controlling the tradeoff.
Related
Information bottleneck
The Information bottleneck method can be viewed as an RD-style objective over representations: minimize while maximizing predictive information . The canonical objective is and one common RD interpretation uses a KL-based distortion term (e.g., between and ) rather than reconstruction error.
Link to VAE
In deep learning, the -Variational autoencoder embodies the rate-distortion tradeoff via the ELBO (Evidence Lower Bound):
The terms map onto the rate-distortion Lagrangian (with expectation over the data distribution):
- is the negative distortion: how well can you reconstruct from the latent ?
- is the rate: how much the encoder’s posterior deviates from the prior on average. If the encoder ignores the input and outputs the prior, KL = 0 (zero rate). If it encodes fine-grained distinctions, KL is large. The expected KL is an upper bound on .
So the ELBO has the same structure as : maximize fidelity minus times rate. tightens the bottleneck — typically more compression and often better disentanglement, at the cost of worse reconstruction.
See also
- DAmato2025number: applies RDT to explain emergence of number sense in -VAEs